4 research outputs found
TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents
Recent work has identified that classification models implemented as
neural networks are vulnerable to
data-poisoning and Trojan attacks at training time.
In this work, we show that these
training-time vulnerabilities extend to
deep reinforcement learning (DRL) agents
and can be exploited by an adversary with access
to the training process.
In particular, we focus on
Trojan attacks that augment the function of
reinforcement learning policies
with hidden behaviors.
We demonstrate that such attacks can be implemented
through minuscule data poisoning (as little as 0.025% of the training data) and
in-band
reward modification that does not affect
the reward on normal inputs.
The policies learned with our proposed attack approach perform imperceptibly similar to benign policies but deteriorate drastically when the Trojan is triggered
in both targeted and untargeted settings.
Furthermore, we show that existing Trojan defense mechanisms for classification tasks are not effective in the reinforcement learning setting
TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents
Recent work has identified that classification models implemented as neural
networks are vulnerable to data-poisoning and Trojan attacks at training time.
In this work, we show that these training-time vulnerabilities extend to deep
reinforcement learning (DRL) agents and can be exploited by an adversary with
access to the training process. In particular, we focus on Trojan attacks that
augment the function of reinforcement learning policies with hidden behaviors.
We demonstrate that such attacks can be implemented through minuscule data
poisoning (as little as 0.025% of the training data) and in-band reward
modification that does not affect the reward on normal inputs. The policies
learned with our proposed attack approach perform imperceptibly similar to
benign policies but deteriorate drastically when the Trojan is triggered in
both targeted and untargeted settings. Furthermore, we show that existing
Trojan defense mechanisms for classification tasks are not effective in the
reinforcement learning setting
TrojDRL: evaluation of backdoor attacks on deep reinforcement learning
We present TrojDRL, a tool for exploring and evaluating
backdoor attacks on deep reinforcement learning agents.
TrojDRL exploits the sequential nature of deep reinforcement
learning (DRL) and considers different gradations of threat
models. We show that untargeted attacks on state-of-the-art
actor-critic algorithms can circumvent existing defenses built
on the assumption of backdoors being targeted. We evaluated
TrojDRL on a broad set of DRL benchmarks and showed that
the attacks require only poisoning as little as 0.025% of training
data. Compared with existing works of backdoor attacks on
classification models, TrojDRL provides a first step towards
understanding the vulnerability of DRL agents.Accepted manuscrip